By: Lamia Alsalloom
In this section, we invert the generator by solving a nonconvex optimization problem in the latent space of a pretrained StyleGAN model using the w+ representation. The following results are obtained by applying different combinations of loss functions during inversion, including an Lp (L1) loss that enforces pixel level similarity, a perceptual loss that preserves high level features from a pretrained network, and an L2 regularization term on the latent update (delta) to constrain the optimization. The following outputs below show how each combination impacts the reconstruction quality, highlighting improvements in image detail and background fidelity when using the perceptual loss and regularization .
1. various combinations of the losses including Lp loss, Preceptual loss and/or regularization loss that penalizes L2 norm of delta:
Original Image
L1 pixel loss weight 0
perc loss weight 0.01
Regularization Loss Weight 0.001
L1 pixel loss weight: 0
perc loss weight: 0
Regularization Loss Weight: 0.001
L1 pixel loss weight: 1
perc loss weight: 1
Regularization Loss Weight: 0.1
L1 pixel loss weight: 10
perc loss weight: 1
Regularization Loss Weight: 0.1
L1 pixel loss weight: 10
perc loss weight: 0
Regularization Loss Weight: 0
L1 pixel loss weight: 0
perc loss weight: 0
Regularization Loss Weight: 0.1
L1 pixel loss weight: 0.1
perc loss weight: 0.01
Regularization Loss Weight: 0.001
2. different generative models including vanilla GAN, StyleGAN:
The following results are obtained by optimizing a combination of L1 and perceptual losses over 1000 iterations on the z latent space. All experiments are run on A100 GPU. Our experiments demonstrate that although the vanilla GAN is computationally efficient (~40s) due to its streamlined architecture, its reconstructions are significantly less detailed and realistic compared to the higher fidelity outputs generated by StyleGAN (~47s).
Original Image
Vanilla GAN
StyleGAN
3. different latent space (latent code in z space, w space, and w+ space):
The following results are obtained by using different latent spaces on the StyleGAN
model, while optimizing a combination of L1 and perceptual losses over 1000 iterations. All experiments are run on A100 GPU, each took around (~47s).
4. Give comments on why the various outputs look how they do. Which combination gives you the best result and how fast your method performs:
- When using the z latent space for StyleGAN, the optimization becomes particularly challenging because gradients must pass through the additional mapping network, often leading to reconstructions that remain very close to the initial state rather than converging to the input.
- Both w and w+ latent spaces provide improved reconstruction quality, the w+ space delivers greater detail and background fidelity owing to its increased expressiveness and flexibility.
- experiments show that vanilla GAN inversions take about 40 seconds per image, while inversions based on StyleGAN require roughly 47 seconds per image.
- The results indicate that GAN outputs tend to be unstable and that using perceptual loss is essential for generating images that closely match the reference image, whereas regularizing the latent update (delta) has little effect.
- The best performance in our experiments is achieved using StyleGAN with either the w or w+ latent spaces, as these methods capture the input image more accurately than using the z space, and StyleGAN outperforms vanilla GAN for inversion tasks.
- The best result is obtained by using the StyleGAN with w+ space, Lp loss weight 10, perceptual loss weight 0.01, and regularization loss weight 0.0. The method performs 1000 iterations within 47s.
I drew this cat
I drew this cat
Show some example outputs of your guided image synthesis on at least 2 different input images:
Grumpy cat reimagined as a royal painting
Grumpy cat reimagined as a royal painting
Show some example outputs of your guided image synthesis on 2 different amounts of noises added to the input:
A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style
A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style
A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style
Show some example outputs of your guided image synthesis on 2 different classifier-free guidance strength values:
For this part, we fix steps to 700 and use the same prompt on multiple sketch images with varying values of guidance strength.
Prompt: A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style.
Interpolate between two latent codes in the GAN model, and generate an image sequence (2pt):
Implement additional types of constraints. (3pts each): e.g., sketch/shape constraint and warping constraints mentioned in the iGAN paper, or texture constraint using a style loss.:
(1) sketch/shape constraint: We have implemented an extra edge constraint based on a Sobel operator. This constraint is a sketch/shape constraint, it computes edge maps from both the generated image and the reference sketch, and then applies an L1 loss between these edge maps. This helps to enforce that the generated image retains the structural details, such as outlines or edges, that are present in the input sketch.
0_data.png
0_mask.png
0_250.png
0_500.png
0_750.png
0_1000.png
The additional edge constraint nudges the generator to align with the edges of the scribble, preserving crucial contours and outlines in the final image so that it better reflects the structure of the input sketch.
def compute_sobel_edges(img):
"""
Compute an approximate edge map using the Sobel operator.
img: Tensor of shape (B, C, H, W) with values in [0, 1].
Returns a tensor of shape (B, 1, H, W) representing edge magnitudes
"""
sobel_x = torch.tensor([[-1., 0., 1.],
[-2., 0., 2.],
[-1., 0., 1.]], device=img.device).view(1, 1, 3, 3)
sobel_y = torch.tensor([[-1., -2., -1.],
[ 0., 0., 0.],
[ 1., 2., 1.]], device=img.device).view(1, 1, 3, 3)
gray = img.mean(dim=1, keepdim=True)
edge_x = F.conv2d(gray, sobel_x, padding=1)
edge_y = F.conv2d(gray, sobel_y, padding=1)
edges = torch.sqrt(edge_x ** 2 + edge_y ** 2)
return edges
def edge_loss(generated, sketch):
"""
Compute an L1 loss between the edge maps of the generated image and the input sketch.
generated, sketch: Tensors of shape (B, C, H, W) with values in [0, 1]
"""
gen_edges = compute_sobel_edges(generated)
sketch_edges = compute_sobel_edges(sketch)
return F.l1_loss(gen_edges, sketch_edges)
(2) Texture loss: Texture loss uses a style loss computed via Gram matrices extracted from the pretrained network, to capture and enforce similar texture patterns between the generated image and a reference texture image. what it does, it encourages the generated image to adopt the local texture statistics, such as color distributions and patterns of the provided texture reference, improving the overall style consistency of the output.
4_data.png
4_mask.png
4_250.png
4_500.png
4_1000.png
Refrence texture image
By applying a high texture weight (1.0) alongside minimal edge and shape weights, the generated images strongly incorporate patterns from the reference zebra like texture, causing the cat’s fur and background to adopt striping or bold textural elements. With lower edge/shape constraints, the sketch outlines have less influence on structure, allowing the texture to dominate the final style.
class TextureLoss(nn.Module):
def __init__(self, layers=[3, 8, 17, 26]): # example layer indices
super(TextureLoss, self).__init__()
# use a pretrained VGG19 feature extractor
self.vgg = vgg19(pretrained=True).features.eval().to(device)
for param in self.vgg.parameters():
param.requires_grad = False
self.layers = layers
def gram_matrix(self, features):
B, C, H, W = features.size()
features = features.view(B, C, H * W)
gram = torch.bmm(features, features.transpose(1, 2))
return gram / (C * H * W)
def forward(self, generated, texture):
loss = 0.0
x = generated
y = texture
gen_features = []
tex_features = []
for i, layer in enumerate(self.vgg):
x = layer(x)
y = layer(y)
if i in self.layers:
gen_features.append(x)
tex_features.append(y)
for gf, tf in zip(gen_features, tex_features):
loss += F.mse_loss(self.gram_matrix(gf), self.gram_matrix(tf))
return loss